1 Introduction

“I never had a problem reaching a decision based on imperfect information. That’s just the way the world works.” ― Alex Ferguson, Leading: Learning from Life and My Years at Manchester United1

Legendary football manager Sir Alex Ferguson surely never had a problem in managing Manchester United from 1986 to 2013. However, for other clubs who try to win the English Premier League, they may have to rely on other insights to achieve this goal.

Principal component analysis will be one of the keys to answer this question.

Abstract

  1. Problem
  • What make a team win?
  • In what way some teams are better than the others?
  1. Dataset description
  1. Method to use
  • Principal component analysis. The reasons of using PCA can be summarized as follow:
    1. The main idea is trying to capture as much information as possible while reducing the complexity of problems for easier analysis. Each principal component (PC) is a linear combination of the variables makes up a principal component. The loadings show the relative importance of the variable of each PC.
    2. Meaningful 2D plot can be made instead of handling fancy high dimenisonal plots.
    3. No need to define dependent variable or make assumption on underlying distribution of variables. It gives flexibility in analysis the problem.
  1. Findings
  • Home Team performance and number of foul plays are crucial in determining how clubs rank in the final standing.
  1. Conclusion
  • Principal componenet analysis can sort out the winning elements and help club managers to run their clubs.

2 Set Up

2.2 Data field legend

The season 2017-2018 statistics is downloaded from DataHub. These data contain the results of 380 EPL matches. There are total 22 variables with 12 variables measuring the team play statistics. Below is the descriptive information of the dataset.

Field information
Field.Name Order Type..Format. Description
Date 1 date (%Y-%m-%d) Match Date (dd/mm/yy)
HomeTeam 2 string (default) Home Team
AwayTeam 3 string (default) Away Team
FTHG 4 integer (default) Full Time Home Team Goals
FTAG 5 integer (default) Full Time Away Team Goals
FTR 6 string (default) Full Time Result (H=Home Win, D=Draw, A=Away Win)
HTHG 7 integer (default) Half Time Home Team Goals
HTAG 8 integer (default) Half Time Away Team Goals
HTR 9 string (default) Half Time Result (H=Home Win, D=Draw, A=Away Win)
Referee 10 string (default) Match Referee
HS 11 integer (default) Home Team Shots
AS 12 integer (default) Away Team Shots
HST 13 integer (default) Home Team Shots on Target
AST 14 integer (default) Away Team Shots on Target
HF 15 integer (default) Home Team Fouls Committed
AF 16 integer (default) Away Team Fouls Committed
HC 17 integer (default) Home Team Corners
AC 18 integer (default) Away Team Corners
HY 19 integer (default) Home Team Yellow Cards
AY 20 integer (default) Away Team Yellow Cards
HR 21 integer (default) Home Team Red Cards
AR 22 integer (default) Away Team Red Cards

2.3 Data exploration

The EPL has 20 clubs and each club will play the others twice in the season, once at their home stadium and once at that of their opponents’, for 38 games. Therefore the total number of records are 20 x 19 of 380 with 12 independent variables, which makes up 4,560 data points. The analysis can be easily extended to include other seasons. However, for simplicity, our study just use Season 2017 - 2018.

Game play statistics are independent variables to explain the game, i.e. variance of the game, while the output are game results. Only variables from 11 to 23 are used for principal component analysis. Besides, the analysis is based on HomeTeam data. AwayTeam can be done in the same way.

##        HS              AS             HST              AST      
##  Min.   :189.0   Min.   :106.0   Min.   : 55.00   Min.   :37.0  
##  1st Qu.:210.8   1st Qu.:182.0   1st Qu.: 65.75   1st Qu.:65.0  
##  Median :231.5   Median :217.0   Median : 73.00   Median :74.5  
##  Mean   :258.1   Mean   :206.2   Mean   : 87.90   Mean   :71.5  
##  3rd Qu.:291.5   3rd Qu.:232.0   3rd Qu.: 99.75   3rd Qu.:81.0  
##  Max.   :359.0   Max.   :264.0   Max.   :151.00   Max.   :97.0  
##        HF              AF              HC               AC        
##  Min.   :165.0   Min.   :152.0   Min.   : 80.00   Min.   : 47.00  
##  1st Qu.:179.2   1st Qu.:191.2   1st Qu.: 93.75   1st Qu.: 77.25  
##  Median :191.5   Median :201.0   Median :103.00   Median : 88.00  
##  Mean   :194.2   Mean   :199.1   Mean   :109.05   Mean   : 86.35  
##  3rd Qu.:209.0   3rd Qu.:213.5   3rd Qu.:126.25   3rd Qu.: 95.25  
##  Max.   :234.0   Max.   :237.0   Max.   :148.00   Max.   :128.00  
##        HY              AY              HR             AR     
##  Min.   :18.00   Min.   :16.00   Min.   :0.00   Min.   :0.0  
##  1st Qu.:23.75   1st Qu.:24.50   1st Qu.:0.00   1st Qu.:0.0  
##  Median :27.50   Median :31.00   Median :1.00   Median :1.0  
##  Mean   :28.10   Mean   :29.75   Mean   :0.85   Mean   :1.1  
##  3rd Qu.:31.75   3rd Qu.:35.25   3rd Qu.:1.25   3rd Qu.:2.0  
##  Max.   :41.00   Max.   :40.00   Max.   :3.00   Max.   :4.0

Some facts can be concluded from the boxplot:

  1. Home Team is more aggressive in attack. This is supported by higher Home Shot, Home Shot on Target and Home Corner values.
  2. Fouls are comparable between Home Team and Away Team.

2.3.1 Correlation plot

Some facts can be concluded from the correlation plot:

  1. The highest negative correlation -0.8 AS to HS. If one team controls the game, the attack is sereve and it can turn the opponent defensive.

  2. Shoot on Target will lead to more more Corner Kick, so AST is positively related to AC, with 0.81.

  3. Home Foul is negatively related to Home Shot, which can be interperted as better play with better sport manner.

  4. While Away Foul is not strongly correlated to other factors. Red Card factor is comparable between Home Team and Away Team, which can be interperted as Red Card is an event not related to attack or defence statistics, maybe it is more a referee related issue.

  5. All Home-related and Away-related factors are negatively correlated to each other, which is a resonable representation.

3 Methodology

The theory behind principal component analysis (PCA) is to reduce the dimensionality of a data set consisting of a large number of correlated variables, while preserving as much as information present in the data set. To achieve this goal, a new set of variables, the principal components (PCs), are constructed by transforming from the original variables. The PCs are uncorrelated and sorted by the highest variance explained to the lowest. To illustrate, if there are 5 PCs, PC1 will be the first principal component that explained the most variance of the original variables’ covariance matrix.

For PCA, prcomp is used because according to literature2, prcomp uses singular value decomposition which is generally the preferred method for numerical accuracy.

3.1 Summary of PCA

## Importance of components:
##                          PC1    PC2    PC3     PC4    PC5     PC6     PC7
## Standard deviation     2.427 1.1648 1.1065 1.03176 0.9992 0.74891 0.64172
## Proportion of Variance 0.491 0.1131 0.1020 0.08871 0.0832 0.04674 0.03432
## Cumulative Proportion  0.491 0.6041 0.7061 0.79483 0.8780 0.92476 0.95908
##                            PC8     PC9    PC10    PC11   PC12
## Standard deviation     0.47133 0.34027 0.27169 0.24999 0.1296
## Proportion of Variance 0.01851 0.00965 0.00615 0.00521 0.0014
## Cumulative Proportion  0.97759 0.98724 0.99339 0.99860 1.0000

3.2 PCA results

3.2.3 Eignevalues analysis

Since the eigenvalues beyond PC5 are significantly less than 1, according to Kaiser Rule, which mean they are explaining less variance than one independent variable. Therefore, total 5 PCs will be used.

3.2.4 Contribution for loadings of each PC

##        Dim.1    Dim.2    Dim.3    Dim.4    Dim.5
## HS  14.38937  1.51669  0.02446  0.53439  0.78434
## AS  14.13506  1.40576  0.00005  0.01467  5.86522
## HST 13.50251  2.83039  0.12417  0.37533  2.69839
## AST 13.75937  0.16770  0.56088  0.12155  0.95727
## HF   6.36282 15.78376  0.77463  0.30650 13.88877
## AF   0.46109  5.30076 53.92940  1.46386 14.90660
## HC  14.03100  1.98957  0.09251  0.65872  0.00659
## AC  12.42937  3.62713  2.10217  1.06416  1.07972
## HY   6.87327  0.42215  2.87897  6.26729 37.96259
## AY   3.33618 21.41235 32.88986  0.66215  0.19191
## HR   0.07654  0.63199  1.02227 88.50718  1.05281
## AR   0.64341 44.91173  5.60064  0.02419 20.60579

The 1/12 of 8.125% level is represented by the red dash line.

PC1 is attacked related, as high loadings are Home Team Shots, Away Team Shots, Home Team Shots on Target, Away Team Shots on Target, Home Team Corners and Away Team Corners, which explains 49.10% of variance. It is reasonable as attack is the best way to win a game in soccer, so as to explaining the game.

PC2 is a referre statistics, which seems referre related issue since both Home and Away Team are involved in the statistics. Major contributions are Home Team Fouls Committed, Away Team Yellow Cards and Away Team Red Cards. PC3 explains 11.31% variance.

PC3 is Away Team foul play statistics. High loadings are Away Team Fouls Committed and Away Team Yellow Cards. PC3 explains 10.20% variance.

PC4 is Home Team Red Cards statistics. PC4 explains 8.87% variance.

PC5 seems representing Home Team Advantage. The reason is the loadings of Home Team Fouls Committed and Away Team Fouls Committed are similar, but Away Team Red Cards is much higher than Home Team Yellow Cards. PC5 explains 8.32% variance. The correlation circle in Section 4.1provides graphical repsentation of this PC.

4 Findings

4.1 Correlation circle

The correlation between a variable and a principal component (PC) is used as the coordinates of the variable on the PC. The representation of variables differs from the plot of the observations: The observations are represented by their projections, but the variables are represented by their correlations (Abdi and Williams 2010).

From the plot, positive correlated variables will group together. ‘Attack’ attributes are grouped together.

One interesting observation is home team fouls are positively related to yellow cards while negative related to red cards. Away teams are vice versa. This can be easily related to home advantage of field games. Referees may be more inclined to give minor punishment to home team while away teams have higher chance to get red cards for foul play. This is a strong support to PC5.

4.3 Biplot

Biplot is a combination of row data to PCs. Biplot visualize the data by assigning the PC1 and PC2 to X and Y Axis of Scatter chart like below.

Blue arrows start from origin are variables. Each club is shown as dot coming from the original rows.

From the analysis, below are some major findings.

  1. Manchester City and Chelsea are high in PC1, which translated to good home performance. Final standings of Chelsea was 5 and Manchester City was the champion. So strong home performance is a must in winning the league.

  2. Low number in Foul can divide the club ranking. Left side clubs, Liverpool, Tottenham, Manchester City, Chelsea and Arsenal, are top 6 in final ranking. Manchester United seems an outliner in the elite group as it was the first runner-up but located at the middle of PC1. North-east region clubs, including West Bromwich Albion, Swansea City and Stoke City are high in PC2 which means more foul. Stoke City got high Home Yellow Cards too. Besides, these three teams are not good in both Home and Away attacks. According to the final ranking, these three teams performed worse and relegated in Season 2018-2019.

4.4 Loadings

4.6 Mean of PC1 and PC2

4.8 Away Team Analysis

For complete demonstration, Away Team PCA is carried out as below.

##        HS              AS             HST              AST        
##  Min.   :132.0   Min.   :122.0   Min.   : 47.00   Min.   : 35.00  
##  1st Qu.:230.8   1st Qu.:169.8   1st Qu.: 79.00   1st Qu.: 60.75  
##  Median :265.5   Median :199.0   Median : 90.00   Median : 68.50  
##  Mean   :258.1   Mean   :206.2   Mean   : 87.90   Mean   : 71.50  
##  3rd Qu.:299.5   3rd Qu.:241.0   3rd Qu.: 96.75   3rd Qu.: 87.25  
##  Max.   :335.0   Max.   :317.0   Max.   :123.00   Max.   :110.00  
##        HF              AF              HC              AC        
##  Min.   :141.0   Min.   :172.0   Min.   : 56.0   Min.   : 55.00  
##  1st Qu.:186.8   1st Qu.:182.5   1st Qu.: 89.0   1st Qu.: 72.00  
##  Median :194.5   Median :201.0   Median :113.0   Median : 86.00  
##  Mean   :194.2   Mean   :199.1   Mean   :109.0   Mean   : 86.35  
##  3rd Qu.:202.0   3rd Qu.:215.2   3rd Qu.:127.2   3rd Qu.:102.00  
##  Max.   :232.0   Max.   :226.0   Max.   :153.0   Max.   :146.00  
##        HY              AY              HR             AR      
##  Min.   :13.00   Min.   :17.00   Min.   :0.00   Min.   :0.00  
##  1st Qu.:24.75   1st Qu.:25.50   1st Qu.:0.00   1st Qu.:0.75  
##  Median :27.00   Median :31.00   Median :1.00   Median :1.00  
##  Mean   :28.10   Mean   :29.75   Mean   :0.85   Mean   :1.10  
##  3rd Qu.:34.25   3rd Qu.:34.25   3rd Qu.:1.00   3rd Qu.:2.00  
##  Max.   :40.00   Max.   :38.00   Max.   :3.00   Max.   :3.00

## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6
## Standard deviation     2.3100 1.4542 1.1921 0.96504 0.81775 0.76849
## Proportion of Variance 0.4447 0.1762 0.1184 0.07761 0.05573 0.04922
## Cumulative Proportion  0.4447 0.6209 0.7393 0.81692 0.87264 0.92186
##                            PC7     PC8     PC9    PC10    PC11    PC12
## Standard deviation     0.59752 0.53131 0.38977 0.30503 0.22007 0.07058
## Proportion of Variance 0.02975 0.02352 0.01266 0.00775 0.00404 0.00042
## Cumulative Proportion  0.95161 0.97514 0.98780 0.99555 0.99958 1.00000

Huddersfield Town3 is an outliner. Final standing was not high but got good home performance. Part of the reasons maybe the turnaround matches in the League, which drew big clubs like Chelsea and Manchester United.

5 Conclusion

Principal componenet analysis can sort out the winning elements and help club managers to run their clubs.

Principal component analysis can represent infomration in a lower dimension which can make analysis easier to handle and find out different aspects of factors. If analysers focus on first few PCs, they can make a model with better performance.

6 Reference

  1. Applied Multivariate Statistical Analysis, 5th ed., Richard Johnson and Dean Wichern, Prentice Hall.
  2. Principal Component Analysis (PCA) 101, using R, Peter Nistrup, https://towardsdatascience.com/principal-component-analysis-pca-101-using-r-361f4c53a9ff
  3. Principal Component Analysis in R: prcomp vs princomp, Statistical tools for high-throughput data analysis, http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/118-principal-component-analysis-in-r-prcomp-vs-princomp/
  4. How to rate the performance of a soccer team? An application of Principal Components Analysis, https://fcostartistician.wordpress.com/2017/05/22/how-to-rate-the-performance-of-a-soccer-team-an-application-of-principal-components-analysis/
  5. Principal Component Analysis (PCA) with FactoMineR(decathlon dataset)François Husson & Magalie Houée-Bigot, http://factominer.free.fr/course/doc/RMarkdown_PCA_Decathlon.pdf

  6. Principal Component Analysis, 2nd ed., I.T. Jolliffe, Springer.